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execute compulerproirram insiniclions oui olproj^nimorder. According to un enibudimenl ollhe present invenlion e;ich ol'ii pluruliiy 
of (breads (5, 6, 7) are associated with a respective data structure (9, 10. II) comprising a fiumber of hits (12) ihai correspond to 
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comparing the data sirucUire of the thread with the data structures ol" other llneads on which ihe ihi^cad may depend. 
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COLLISION HANDLING APPARATUS AND METHOD 
FIELD OF THE INVENTION 

The present invention relates in general to execution of computer program 
instructions, and more specifically to thread-based speculative execution of 
computer program instructions out of program order. 

BACKGROUND OF THE INVENTION 

The performance of computer processors has been tremendously enhanced 
over the years. This has been achieved both by means of making operations 
faster and by means of increasing the parallelism of the processors, i.e. die 
abihty to execute several operations in parallel. Operations can for instance be 
made faster by means improving transistors to make them switch faster or 
optimizing die design to minimize die level of logic needed to implement a 
given function. Techniques for paralleUsm include processing computer 
program instructions concurrently m multiple threads. There are programs diat 
are designed to execute in several concurrent tiireads, but a program that is 
designed to execute in a single diread can also be executed in several 
concurrent threads. If the execution of a program in several concurrent 
threads causes program instmctions to be executed in an order that differs 
from tiie program order in wMch die program was designed to execute die 
duead execution is speculative. The discussion hereinafter focuses on such 
speculative thread execution. 

A computer program diat has been designed to be executed in a single tiiread 
can be paralleKsed by dividing die program flow into multiple threads and 
speculatively executing diese threads concurrendy usuaUy on multiple 
processmg umts. The international patent appUcation WOOO/29939 describes 
techniques diat may be used to divide a program into multiple direads. 
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However, if the tlireads access a shared memory, collisions between tlie 
concxurently executed threads may occur. A collision is a situation in wliich the 
threads access the shared memor)^ in such a way tliat tliere is no guarantee that 
the semantics of the ori^al single-dueaded program is preserved. 

A collision may occur when two concurrent threads access the' same memoty 
element in the shared memory. An example of a collision is when a first thread 
writes to a memory element and the same memory element has akeady been 
read by a second thread which follows the first thread in the program flow of 
the single-threaded program. If tlie write operation performed by the first 
thread changes tlie data in the mtmor/ element, the second tliread will read 
tlie wrong data, which may give a result of program execution that differs from 
the result that woiold have been obtained if the program had been executed in 
a single thread. Depending on the implementation, collisions can for example 
also occur when two threads write to the same memory element in the shared 
memory. 

Execution of a computer program in multiple concurrent threads is intended 
to speed up program execution, widiout altering die semantics of the program. 
It is therefore of interest to provide a mechanism for detecting collisions. 
When a collision has been detected one or more threads can be roUed back in 
order to make sure diat the semantics of the single-threaded program is 
presei-ved. A rollback involves restarting a thread at an earlier point in 
execution, and undoing everything diat has been done by tlie thread after that 
point. In the example above, in which the older first thread wrote to a memorj^ 
element tliat already had been read by the younger second thread, the second 
diread should be rolled back, at least to tlie point when the memory element 
was read, if it is to be guaranteed diat the semantics of the single-tiireaded 
program is preserved. 
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A known mechanism fot detecting and handling collisions involves keepiag 
track of accesses to memory elements by means of associating two or more 
flag bits per thread with each memory object. One of diese flag bits is used to 
indicate that the memory object has been read by the thread, and another bit is 
used to indicated that the memory object has been modified by tlie thread. 

The international patent appUcation WO 00/70450 describes an example of 
such a Icnown mechanism. Before a primary thread writing to a memory 
element in a shared memory, status information associated with the memory 
element is checked to see if a speculative thread has read the memory element 
If so, the speculative thread is caused to roll back so that the speculative thread 
can read the result of the write operation. 

A disadvantage of this laiown mechanism when implemented in software is 
that it results in a large execution overhead due to tiie communication and 
synchronization between the tiireads that is required for each access to tiie 
shared memory. The status information is accessible to several threads and a 
locking mechanism is therefore required in order to make sure that errors do 
not occur due to concurrent access to tht same status information by two 
threads. There is also a need for memory barriers (also called memory fences) 
in order to ensure correct ordering between accesses to the shared memory 
and accesses to the status information. 

Anotiier example of a Icnown mechanism for detecting and handling collisions 
is described in Steffan J.G. et al., "The Potential for Using Thread-Level Data 
Speculation to Facilitate Automatic ParaUelization", Proceedings of die Fourtii 
International SymposiuiTi on High-Performance Computer Architecture, 
February 1998, and in Oplinger J. et al., "Software and Hardware for 
Exploiting Speculative Parallelism witii a Multiprocessor", Stanford University 
Computer Systems Lab Technical Report CSL-TR-97-715, Februar)^ 1997. An 
extended cache coherency protocol is used to support-speculative threads. 
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The flag bits are, according to this technique, associated with cache lines in a 
first level cache of each of a plurality of processors. When a tliread performs a 
write operation, a standard cache coherency protocol invalidates tlie affected 
cache line in the other processors. By extending the cache coherency protocol 
5 to include the tliread number in tlie invalidation request, tlie odier processors 
can detect read after write dependence \dolations and perform rollbacks if 
necessary. A disadvantage of tliis approach is that speculatively accessed cache 
lines have to be kept in the first level cache until the speculative thread has 
been committed, othei-wise the extra information associated with each cache 
10 line is lost If the processor runs out of available positions in tlie furst level 

cache during execution of the speculative thread, the speoalative thread has to 
be rolled back. Another disadvantage is that the metliod requires modifications 
to the cache coherency protocol implemented in hardware, and cannot be 
implemented purely in software using standard microprocessor components. 

1 5 SUMMARY OF THE INVENTION 

As mentioned above the loiown mechanisms for handling and detecting 
collisions have some disadvantages. The problem solved by the present 
invention is to provide mechanisms that simplify handling and detection of 
collisions. 

20 A first object of tiie present invention is to provide a device having simplified 
mechanisms for recording information regarding memory accesses to a shared 
memory. 

A second object of die present invention is to provide a simplified metiiod for 
recording information regarding memory accesses to a shared memory. 

25 A third object of the present invention is to provide a simplified metiiod for 
handling possible collisions between a pluralit)^ of tiu-eads. 
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The objects of the present invention are achieved by means of an apparatus 
according to claim 1, by means of a metliod according to claim 17 and by means 
of a method according to claim 27. The objects of the invention are fbrther 
achieved by means of computet program products according to claim 36 and 
claim 37. 

According to the present invention each of a pluraHty of tlireads are associated 
with a respective data stmcture for storing information regarding accesses to die 
memory elements of the shared memory. When a tiiread accesses a selected 
memory element in the shared memory, information is stored in its associated 
data stmcture, which information is indicative of the access to the selected 
memory element According to an embodiment of die present invention 
collision detection is carried out after the diread has finished executing by means 
of comparing the data stmcture of tiie thread witii the data stnictures of otiier 
tiureads on which the thread may depend. 

An advantage of the present invention is tiiat each thread is associated with a 
respective data stmcture tiiat stores die information indicative of ±e accesses to 
the shared memory. This is especiaUy advantageous in a software 
implementation since each tiiread will only modify tiie data stmcture witii which 
it is associated. The tiireads will read die data stmctures of otiier tiireads, but 
tiiey will only write to tiieir own associated data stmcture according to die 
present invention. The need for locking mechanisms is tiierefore reduced 
compared witii die known solutions discussed above in which tiie information 
indicative of memory accesses were associated witii die memory elements of die 
shared memory and were modified by aU die tiireads. The reduced need for 
locking mechanisms reduces die execution overhead and makes die 
implementation simpler. In die software mplementation, die absence of locks 
and memory barriers during tiiread execution will also give a compiler more 
freedom to optimize tiie code. 
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Another advantage of die present invention is tiiat, since it does not require a 
modified cache coherency protocol, it can be unplemented purely in software, 
thus maldng it possible to implement the invention using standard components. 

Furdier advantages of embodiments of the present invention will be apparent 
5 from the following detailed description of preferred embodiments with 
reference to accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a schematic block diagram of a computer system in which the present 
invention is used, 

10 Figs, 2A and 2B are schematic diagrams tiiat illustrate a computer program .being 
executed in a single thread and divided into several tiireads respectively. 
Fig. 3A is schematic block diagram that illustrates how data stmctures according 
to tiae present invention are used. 

Fig. 3B is schematic block diagram that illustrates how an alternative 
1 5 embodiment of data stmctures according to the present invention is used. 

Fig. 4 is a flow diagram illustrating how reading from the shared memory may be 
perfomied according to die present invention. 

Fig. 5 is a flow diagram illustrating how writing to tiie shared memory may be 
performed according to the present invention. 
2 0 Fig. 6 is a schematic block diagram tiiat illustrates dependence Usts associated 
witii tiireads according to tiie present invention. 

Fig. 7 is a flow diagram illustrating how a tiuead may be executed and a collision 
check for tiie tiiread may be made according to the present invention. 

2 5 DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS 

Figure 1 illustrates a computer system 1 including two central processing units 
(CPUs) first CPU 2 and second CPU 3. The CPUs accesses a shared memory 4, 
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divided into a number of memory elements mO, ml, m2, mn. The memoiy 
elements may for instance be equal to a cache line or may alternatively 
correspond to a variable or an object in a source language. Figure 1 also shows 
three threads 5, 6, 7 executing on the CPUs 2, 3. 

A thread can be seen as a portion of computer program code that is defined by 
two checlcpoints, a start point and an end point, Rgure 2a shows a schematic 
illustration of a computer program 8 comprising a number of instructions or 
operations, il, 12,... in. Wlien the computer program is executed as a single 
thread, the normal way of processing the instmctions is in the program order, 
i.e. from top to bottom in Figure 2A. It is however possible, accordii^ to known 
techniques as mentioned above, to divide the program into multiple tlueads. 
The program 8 may for instance be divided into die diree direads 5, 6, 7 as 
indicated in Figure 2A. The dareads can be executed concurrently. Figure 2B 
illustrates an example of a threaded program flow, where die first CPU 2 first 
processes die diread 5 and dien die diread 6, and die second CPU 3 starts 
processing diread 7 before die direads 5 and 6 have finished executing on die 
first CPU 2. 

Figure 2B shows an example of how die du-eads 5, 6, 7 may execute. Many odier 
alternative ways of executing die direads are however possible. It is for instance 
not necessary diat die first CPU 2 finishes processing die diread 5 before 
starting on die diread 6 and die diread 6 may be executed before die diread 5. 
The first CPU 2 may be a type of processor tiiat is able to smtch between 
several different direads such diat die CPU 2 e.g. starts processing die diread 5, 
leaves die diread 5 before it is finished to process die diread 6 and dien returns 
to die diread 5 again to continue where it left off Such a processor is sometimes 
called a Fine Grained Multi-Tlireading Processor. A Simultaneous Multi- 
Tlireading (SMT) Processor is able to process several direads in parallel, so if die 
CPU 2 is such a processor it is able to process tlie direads 5, 6 simultaneously. 
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Thus, it is not necessary to have mdtiple CPUs in order to process multiple 
tlueads conciorrendy. 

Collisions may occur between the direads 5, 6, 7 when the instructions of die 
computer program 8 are executed out of program order. As mentioned above, a 
5 collision is a situation in which the threads access the shared memory 4 in such 
a way that there is no guarantee diat die semantics of the original single- 
threaded program 8 is preserved. It is therefore of interest to provide 
mechanisms for detecting and handling collisions that may arise during 
speculative thread execution. 

10 According to the present invention each thread 5, 6, 7 is associated with a data 
stmcture 9, 10, 11, which is illustrated schematically in Figure 1. The data 
stmcture is used to store information indicative of which memorj? elements in 
die shared memorj? 4 tiiat the respective duead has accessed. According to an 
embodiment of die present invention each data stmcture includes a number of 

15 bits 12 that correspond to the memory elements in the shared memory. 
According to the embodiment of the present invention shown in Figure 1 die 
bits 12 of each data stmcture 9, 10, 11 are divided into a load vector 9a, 10a, 11a 
and a store vector 9b, 10b, lib. For each memory element mO, ml, m2, mn in 
die shared memory 4, diere is exacdy one corresponding bit 12 in the load 

20 vector and exactiy one corresponding bit 12 in die store vector associated widi 
each diread. Wlien the tiiread 6 reads from a memory element, it sets die 
corresponding bit 12 in the load vector 9a to indicate that the memor)^ element 
has been read. The store vector 9b is updated analogously when the duead 6 
writes to the shared memory. 

25 There can eidier be a one-to-one correspondence or a many- to-one 
correspondence between die memor}^ elements and die bits in die load and store 
vectors. By ha^Hullg a manj^-to-one correspondence, die memory overhead is 
reduced at die cost of spurious collisions, wliicli causes slower execution. 
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Reducing the memoiy overhead will however also result in reduced execution 
overhead, since diere will be fewer cache misses. A hash function can be used to 
map a number of a memory element to a bit position in the load and store 
vectors. 

Figure 3A illustrates an example of how the data structures 9, 10, 11 are used 
according to the present invention. In this example die thread 5 has written to 
the memory elements ml and m4 and read memory elements ml, m5 and m8. 
The diread 6 has written to die memory elements ml, m6 and m9 and read the 
memory elements m2, m6 and ml 3. The diread 7 has read the memory element 
ml 2. In this example, there are more memory elements in the shared memory 
than there are bit positions in die load and store vectors, which means diat diere 
is a many-to-one correspondence between the memory elements and the bits in 
the load and store vectors. In diis example the bit position in the load and store 
vector that corresponds to a selected memory element is found using a hash 
function, which in this example simply calculates die remainder when dividing 
die number of die memory element by die size of the load and store vectors. 
This means that when die thread 5 writes to the memory elements ml, it sets the 
bit in position nximber 1 in its store vector and when the thread 6 writes to the 
memory element m9, it sets die bit in position number 1 in its store vector. 
When die threads have performed die write and read operations mentioned 
above, die bit position numbers that are set will be 0, 1, 5 for the load vector 9a; 
1, 4 for die store vector 9b; 2, 5, 6 for the load vector 10a; 1, 2, 6 for the store 
vector 10b and 4 for die load vector 11a. Tliis is illustrated in Figure 3A by 
means of filled boxes representing the bits diat are set. 

The implementation of the present invention can be simplified by means of die 
data sti-uctures 9, 10, 11 each compnsing a single combined load and store 
vector instead of a separate load vector and a separate store vector. Figure 3B 
illustrates die same example as described above widi reference to Figure 3A, 
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witli the only difference tliat the data structures 9, 10, 11 each includes a single 
combined load and store vector 9c, 10c, 11c instead of tlie load vectors 9a, 10a, 
11a and die store vectors 9b, 10b, lib. The bit positions diat are set in die 
combined load and store vector 9c correspond to a logical bitwise inclusive or 
5 operation of the load vector 9a and store vectors 9b shov/n in Figure 3B. 

Tlie embodiment of die present invention wherein die data structures includes a 
single combined load and store vector results in an increased number of 
spurious collisions, but on die odier hand it also results in a reduced need for 
memor}^ to store the data stmctures and a reduced number of operations when 
1 0 checldng for collisions, as will be discussed furdier below. 

The embodiments of die present invention shown in Figures 3A and 3B uses a 
type of data versioning called pnvatisation, which means that a private copy 14 
of a memory element diat is to be modified is created for the thread that 
modifies die element The tiiread then modifies the private copy instead of die 

15 original memorj^^ element in die shared memorj^ The private copies contain 
pointers 15 to dieir corresponding original memory element in die shared 
memory. The private copies are used to wtite over the original memory elements 
in die shared memorj^ 4 when die threads for wliich tiiey were created are 
committed. If a tiiread is rolled back, its associated private copies 14 are 

20 discarded. Figure 4 shows a flow diagram illustrating how reading from die 
shared memory is perfomied when privatisation is used. Figure 5 shows a 
corresponding flow diagram for writing to the shared memory. 

Figure 4 shows a first step 20, wherein die memory element to be read is marked 
as read in die load vector. In step 21, it is examined whether or not die tiiread 

25 has a private copy of die memory element to be read. If a private copy exists the 
data is read from the private copy, step 22. If diere is no private copy die data is 
read from die memor)^ element in die shared memory, step 23. 
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Figuie 5 shows a first step 25, wherein it is examined whetlier or not the thread 
has a private copy of tlie memory element to be written to. If tliere is no private 
copy, the memory element to be written to is marked as written in tlie store 
vector, step 26, and a private copy is created, step 27. The data is then written to 
the private copy, step 28. If a private copy is found to exist in step 25, the data 
can be written to the private copy directly, step 28, witliout having to make a 
mark in the store vector or create die private copy. 

The privatisation described above is not a prerequisite of the present invention. 
Anotiier type of data versioning, which may be used instead of privatisation, 
involves that the threads store backup copies of the memory elements before 
they modify tiiem. These backup copies are then copied back to die shared 
memory during a rollback. 

The embodiments of the present invention described above comprise data 
str-uctures in the form of bit vectors for storing information indicative the 
thread's accesses to the memory. However, many alternative types of data 
structures for storing lias information are possible according to the present 
invention. The data structures may for instance be implemented as Hsts to which 
numbers tiiat correspond to the memory elements are added to indicate accesses 
tiie memory elements. Otfier possible implementations of the data structures 
include trees, hash tables and otiier representations of sets. 

It will now be discussed.how the tiiread associated data structures of die present 
invention can be used to check for and detect collisions. 

In a software implementation where the thread associated data stmctures of the 
present invention are used to check for coUisions, a duead that has coIEded witii 
anotiier tiiread will itself detect die coUision. In die Imown mechanisms 
discussed above an older tiiread would detect if a younger tiiread has collided 
and send a message about tiiis so tiiat tiie younger tiiread would be rolled back. 
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Tliis sending of messages talces time and causes an extra delay, which can be 
avoided by means of tlie present invention. 

According to a preferred embodiment of tlie present invention collision checks 
are performed after tlie thread has finished its execution and is about to be 
committed. The collision check is made by means of comparing the data 
structure associated with the tluead to be checked widi the data structures 
associated with other threads on wliich tlie diread to be checked may depend. In 
order to keep track of the possible dependencies between liireads a dependence 
Ust may be created for each thread before it starts executing. This is illustrated in 
Figure 6, by means of the threads 5, 6, 7 which are associated with dependence 
lists 16, 17 and 18 respectively. The dependence lists are lists of all older threads 
that had not yet been committed when the thread was about to start executing. 
The thread 7 may depend on threads 5 and 6 so its dependence list 18 contains 
references to threads 5 and 6 to indicate the possible dependency. 

The dependence list described above is just an example of how to keep track of 
possible dependencies between threads. The dependence list is not limited to a 
list stmcture but can also be represented as an alternative stmcture tiiat can store 
information regarding possible dependencies. It is further not necessary for the 
dependence list to store a reference to all older not yet committed threads. For 
example in an implementation where forwarding is used it may be possible to 
determine that the thread to be started is not dependent on some of tlie older 
not yet committed tlueads and it is tlien not necessary to store a reference to 
tliese threads in the dependence list. In otlier cases the information stored in tlie 
dependence list may refer to an interval of tlireads of wliich some already have 
been committed when the dependence list is created. As long as tlie dependence 
list includes a reference to all die direads that the tluead to be started depends 
on there is no harm in the dependence list also including references to some 
threads that the thread to be started clearly does not depend on. 
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Figure 7 shows a flow diagram of how a tliread may be executed and a collision 
check for die diread may be made according to die present invention. In a step 
30, the dependence list for the thread to be executed is created. The duead is 
tlien executed in a step 31. Wlien the ditead has finished executing, it waits until 
5 die threads that it may depend on have been checked for collisions and are ready 
to be committed, step 32. It dien compares its associated data stmcture to die 
data structures associated with the threads in the dependence list to check for 
collisions, step 33. If no collision is detected, die tiiread is committed in a step 
34, odieiwise the thread is rolled back in a step 35. If die diread has collided 
10 widi anodier ditead, die risk diat die diread collides widi the same diread again 
may be reduced by means of delaying the restart of the diread until die thread it 
collided with has been committed. The system may be arranged to give higher 
priority to committing threads with which other threads have collided. 

When the collision check is performed as described above, even the oldest not 
15 yet committed thread is speculative, since it might have collided widi an earlier 
thread that already has been committed and diis is not detected until die diread 
has finished its execution. However, when a diread has become die oldest not 
yet committed diread, it will have to be rolled back at die most once, since when 
it is restarted, there is no other diread that it can collide with. 

20 Alternatively one or several partial collision checks may be performed during 
execution, before performing die collision check when die diread has finished 
executing. The partial collision check can be performed without locldng die data 
structures associated with odier tiireads because it is acceptable that the partial 
check fails to detect some collisions. Collisions that were not detected in die 

25 partial collision check wiU be detected in die final collision check diat is 
performed after die tiuead has finished its execution. 

The comparison between two data stiiictu!:es to detect collisions is perfoixned 
differentiy depending on whetiier or not die data structures includes separated 
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load and store vectors or a combined load and store vector. If die data 
stiTictures have separated load and store vectors the comparison between tlie 
load and store vectors of an older and a younger thread can be carried out by 
means of performing the following logical operations bitwise on tlie bit vectors: 

5 old store vector AND (j^oung store vector OR young load vector) . 

If die resulting vector contains any bits diat are set there is a collision and die 
yoimger thread should be roUed back. If the data stmctures have combined load 
and store vectors die corresponding logical operation to be performed to check 
for collisions is an AND-operation between die combined vector of die older 
10 thread and the combined vector of the younger thread. 

In an alternative embodiment the comparison to detect collisions is carried out 
by means of performing the following logical operation bitwise on the bit 
vectors: 

old store vector AND young load vector. 

15 Tliis comparison assumes that the tiireads are committed in program order and 
diat when a write operation that only modifies part of a memorj^ element (which 
corresponds to a read-modify-write operation) is carded out the corresponding 
bit in botii the load and die store vector is set. 

An advantage of the collision check of die present invention is that since 
20 collisions do not have to be detected until die diread has finished executing, 
tiiere is no need for any locldng mechanism or memory barriers during 
execution. This reduces tiie execution overhead and makes die implementation 
simpler. Anotiier reason why die execution overhead can be reduced according 
to die present invention is diat if die collision check is only performed when the 
25 thread has finished executing, at most one check will have to be made for each 
accessed niemoiy element, even if die element has been accessed many times 
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during execution. In tlie Imown mechanisms discussed above a coJJision check 
was perfonned in connection witli each access to die shared memory. 

The cost of handling coUisions according to die present invention is diat 
coUisions are not detected as early as possible, which results in some wasted data 
processing of threads that already have coUided and should be roUed back. 
However, the gain in execution overhead will in many cases surpass the cost of 
not detecting collisions immediately. Iht collision check of the present 
invention described above is daus particulady favorable when collisions are rare. 

According to die present invention, tlie only thing diat has to be performed in 
die same order as in die original single-direaded program is die coUision check. 
Threads can be executed and roUed back out of program order and depending 
on the implementation sometimes also committed out of program order. 

If die many-to-one correspondence between die memory elements and die bits 
in die load and store vectors is used, die load and store vectors can have a fixed 
size. The memory overhead is dien proportional to die number of direads 
instead of die number of memory elements, which means diat die amount of 
memory needed to store die data stmctures will remain die same when die 
number of memory elements in die shared memory increases. 

The present invention can be implemented bodi in hardware and in software. In 
a hardware implementation it is possible to use a fast fixed-size memory inside 
each processor to store die data structures. In a software implementation a 
speed advantage wiU be obtained if die data structures are made small enough to 
be stored in die first level cache of die processor. Due to die firequent use of die 
data stmctures it will be advantageous to store diem in as fast memory as 
possible. 

The data stmcture associated widi a diread ^^dll naturally only have to be stored 
in memorjr until die duead widi winch it is associated and all dueads diat may 
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depend on the thiead are committed. Once the diread and a]l threads that may 
depend on it are committed the memor)^ used to store its associated data 
structure can be reused. 

Tlie present invention is not limited to any particular type of memory elements 
of a shared memory. The present invention is appHcable to both logical and 
physical memory elements. Logical memory elements are for example variables, 
vectors, sti-uctures and objects in an object oriented language. Physical memory 
elements are for example bytes, words, cache lines, memory pages and memory 
segments. 

As described above a thread comprises a number of program insti-uctions. Other 
terms for a series of instructions that are sometimes used in the field. An 
example of such a term is job. 

Thread-level speculative execution widi a shared memory has many similarities 
to a database transaction system. The entries of a database can be compared 
with the elements of a shared memory and since a database transaction includes 
a number of operations, a database transaction can be compared witii a diread. 
One way to ensure tiiat a database remains consistent is to check for coUisions 
between different database transactions. Thus the principles of die ideas of die 
present invention may be used also in this field. 

It is to be understood tiiat die embodiments of die present invention discussed 
above and illustrated in die figures, merely serves as examples to illustrate die 
ideas of die present invention and tiiat die invention in no way is limited to just 
die examples described. The examples are for instance simple examples diat 
only illustrate a few memory elements in die shared memory and a few bits in 
die data structures associated with die dueads. In realitj' die nvimber of memory 
elements and bits can be verj^ large. The present invention is furdier not limited 
to any particular number of threads or CPUs. 
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CLAIMS 

1. An appai-atus that supports execution of computer program instructions 
speculatively out of program order comprising: 

- a plurality of threads for executing computer program instructions, and 

- a shared memory, which comprises a number of memory elements accessible 
to the plurality of threads; 

wherein each of tlie tiureads are associated with a data stoicmre for storing 
information regarding accesses to tht memory elements of die shared memory 
and wherein each of the threads has means for accessing a selected memory 
element in the shared memory and means for storing information in the 
associated data structure indicative of die access to tiie selected memory 
element 

2. The apparatus according to claim 1, wherein the data structures are one of tht 
following types of stmctijres: an unsorted Hst, a sorted list, a tree and a table. 

3. The apparatus according to claim 1, wherein each data stmcture comprises a 
number of bits diat correspond to tiie memory elements of the shared memory 
and wherein die means for storing information are means for setting at least one 
chosen bit, which at least one chosen bit corresponds to the selected memory 
element 

4. The apparatus according to claim 3, wherein tlae data structure comprises a 
load vector and a store vector, wherein die means for setting at least one chosen 
bit is ai-ranged to set a bit in the load vector when the first titiread accesses die 
selected memory element in order to read it. and wherein die means for setting 
at least one chosen bit is arranged to set a bit in die store vector when die first 
diread accesses die selected memory element in order to write to it 



wo <l3/»546y3 



.18 



PCT/SE(H/02741 



5. The apparatus according to claim 3, wherein tlie data structure comprises a 
single combined load and store vector. 

6. The apparatus according to daim 4 or 5, wherein there is a one-to-one 
correspondence between the memory' elements in the shared memory and the 
bits in the or each vector of the data structure. 

7. The apparatus according to claim 4 or 5, wherein there is a many-to-one 
correspondence between tlie memory elements in the shared memory and the 
bits in the or each vector of the data stmcture. 

8. The apparatus according to claim 7, wherein die correspondence between the 
bits in the or each vector and the memory elements is determined by a hash 
function lliat maps the memory elements to die bits in the or eacli vector. 

9. The apparatus according to any of claims 1-8, wherein the apparatus further 
comprises means for checldng whether a thread has a private copy of the 
selected memory object, means for creating a private copy of the selected 
memory object and means for reading and writing to a private copy of die 
selected memory object 

10. The apparatus according to any of claims 1-8, wherein the apparatus furdier 
comprises means for storing a backup copy of the selected memory element. 

11. The apparatus according to any of claims 1-10, wherein die apparatus 
furtlier comprises means for checldng, when a first thread has finished 
execution, if each of die threads on which the first thread may depend is ready 
to be committed and means for checldng for collision between die fu'st diread 
and each of die direads on wliLch the first thread may depend, wliich means for 
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checldng comprises means for comparing die data stmctm-e associated witii tlie 
fii-st thread widi each respective data structure associated witii die du-eads on 
which die first thread may depend. 

5 12. The apparatus according to claim 11, wherein the apparatus furdier 
comprises means for creating a dependence list associated with die first diread 
before execution of the fijcst thread, which dependence list includes a reference 
to each thread which has not yet been committed and which comes before die 
first thread in program order. 

10 

13. The apparatus according to claim 11 or 12, wherein die apparatus furdier 
comprises means for committing die first thread if no collision is detected 
between the first tiiread and any of die direads on which die first thread may 
depend and means for restarting execution of die first diread if a collision is 

15 detected between the first thread and any of die threads on which the first 
thread may depend. 

14. The apparatus according to claim 13, wherein the apparatus fiuther 
comprises means for delaying a restart of execution of die first diread until die 

20 diread or each of die direads widi which the first thread lias collided has been 
committed. 

15. The apparatus according to ckim 14, wherein the apparatus fiardier 
comprises means for giving priority to committing and/or executing die diread 

25 or each of die direads witia which die first thread has collided. 

16. The apparatus according to any of claims 11-15, wherein the apparatus 
fiirdier comprises means for perfomiing a partial check for collisions between 
die first diread and at least one of die direads on wliich die first tiiread may 
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depend, which means for perfoiming a partial check comprises means for 
comparing the data structure associated witli tlie first diread witli tlie respective 
data structure associated witii the at least one of tlie threads on which tlie first 
thread may depend. 

17. A method for recording information regarding accesses to a shared 
memory, which shared memory is accessible to a plurality of tlueads that are 
arranged to execute computer program instructions speculatively out of program 
order, which method includes die steps of: 

- a first of the plurality of threads accessing a selected memory element in die 

shared memory, and 

- the first thread storing information indicative of the access to the selected 
memory element in a data stmcture associated with die first tiu-ead. 

18. The method according to claim 17, wherein die data structure is one of die 
following types of stmctures: an unsorted list, a sorted list, a tree and a table. 

19. Hie method according to claim 17, wherein each data stmcture comprises a 
number of bits that correspond to die memory elements of the shared memory 
and wherein the step of storing information comprises setting a chosen bit in the 
data stmcture, which chosen bit corresponds to die selected memorjr element. 

20. The method according to claim 19, wherein the data stmcture comprises a 
load vector and a store vector, wherein die chosen bit is a bit in die load vector 
if the first diread accesses die selected memory element in order to read it, and 
wherein the chosen bit is a bit in die store vector if the first diread accesses die 
selected meniorjr element in order to write to it. 
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21. The method according to claim 19, wherein tlie data structure comprises a 
single combined load and store vector. 

22. The method accordiag to claim 20 or 21, wherein there is a one-to-one 
correspondence between die memory elements in the shared memory and the 
bits in the or each vector of the data stmcture. 

23. The method according to claim 20 or 21, wherein there is a many-to-one 
correspondence between the memory elements in the shared memory and the 
bits in the or each vector of the data stiiicture. 

24. The method according to claim 23, wherein the correspondence between the 
bits in the or each vector and the memory elements is determined by means of 
mapping the memory elements to the bits in the or each vector using a hash 
function. 

25. The mediod according to any of claims 17-24, comprising the fiirther steps 
of: 

- the first thread checldng whether it lias a private copy of the selected memory 

object; 

- if the first thread has a private copy and the first thread accesses the selected 

memory element in order to read it, die first thread reading fiom tiie private 
copy; 

- if die first diread does not have a private copy and the first du-ead accesses 

die selected memory element in order to read it, the first duead reading fi-om 
die selected memory element in the shared memory; 

- if tlie first tliread has a private copy and die first du-ead accesses die selected 
memory element in order to write to it, die first diread writing to die private 
copy; and 
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- if die first thread does not have a private copy and the first thread accesses 
tlie selected memory element in order to write to it, tlie fir st tlu-ead creating a 
private copy of die selected memory and writing to tlie private copy. 

26. The method according to any of claims 17-24, comprising the further steps 
of, if the first thread accesses die selected memory element in order to write to 
it, tlie first daread storing a backup copy of die selected memory element and die 
first diread writing to the selected memory element in die sliared memorj' after 
die backup copy is stored. 

27. A mediod for handling possible collisions between a pluralit)' of dueads, 
which direads are arranged to execute computer program instmctions 
speculatively out of program order and to access memory elements of a shared 
memorjr, which mediod includes the steps of: 

executing a first diread; 

checking, when die first diread has finished execution, if each of die direads on 
which die first diread may depend is ready to be committed; 
waiting until each of die dueads on which die first duead may depend is ready 
to be committed, if each of die direads on which die first diread may depend is 
not ready to be committed; and 

checldiig for collision between die fitst diread and each of die direads on which 
die first thread may depend by means of comparing a data stmcture associated 
with the first tiiread with a data stmcture associated with die diread on wliich 
die first thread may depend, which data structures stores inforination regarding 
which of die memory elements die tiuead widi wliich die data stmcture is 
associated has accessed during execution of die tiiread. 

28. The mediod according to claim 27, wherein each data structure comprises a 
nvimber of bits diat correspond to die memorj' elements of die shared memory 
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and wherein a bit is set if the memory element to which tlie bit conesponds has 
been accessed by die duead widi which die data stmcture is associated dunng 
execution of die diread. 

29. The metiiod according to claim 28, wherein each data stmcture comprises a 
load vector and a store vector, wherein a bit in die load vector is set if die 
memory object to wliich die bit corresponds has been read by die diread widi 
which die data structure is associated during execution of die diread and 
wherein a bit in die store vector is set if die memory object to which die bit 
corresponds has been written to by die diread widi which die data structure is 
associated during execution of the thread. 

30. The mediod according to claim 28, wherein each data stmcture comprises a 
single combined load and store vector. 

31. The mediod according to any of claims 27-30, wherein die mediod fortiier 
comprises die step of creating a dependence Hst associated widi die first tiiread 
before execution of die first diread, wHch dependence Hst includes a reference 
to each diread which has not yet been committed and which comes before die 
first tiiread in program order. 

32. The mediod according to any of claims 27-31, wherein die first duread is 
committed if no colHsion is detected and wherein die execution of die first 
diread is restarted if a collision is detected. 

33. The metiiod according to claim 32, wherein die restart of execution of die 
fii-st diread is delayed until die duead or each of die tiueads widi wliich die first 
thread collided has been committed. 
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34. The method according to claim 33, wherein pnorit)' is given to committing 
and/or executing die thread or each of the dueads with which the first thread 
coUided. 

35. The metliod according to any of claims 27-34, comprising die fiartiier step of 
performing a partial check for collisions between die first diread and at least one 
of the direads on wliicli die first duead may depend by means of compariag die 
data stmcture associated widi the first thread widi die respective data structure 
associated with die at least one of the threads on which die first thread may 
depend, wherein no locldng of the data stmctures take place while the partial 
check is performed. 

36. A computer program product comprising computer code means for 
perfomiing die method of any of claims 17-26 when mn on a computer. 

37. A computer program product comprising computer code means for 
performing die method of any of claims 27-35 when wn on a computer. 
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